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Abstract. We discuss the paper "Citation Statistics" by the Joint 
Committee on Quantitative Assessment of Research. In particular, we 
focus on a necessary feature of "good" measures for ranking scientific 
authors: that good measures must able to accurately distinguish be- 
tween authors. 
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1. INTRODUCTION 

The Joint Committee on Quantitative Assessment 
of Research (the Committee) has written a highly 
readable and well argued report discussing common 
misuses of citation data. The Committee argues con- 
vincingly that even the meaning of the "atom" of ci- 
tation analysis, the citation of a single paper, is non- 
trivial and not easily converted to a measure of re- 
search quality. The Committee also emphasizes that 
the assessment of research based on citation statis- 
tics always reduces to the creation of ranked lists 
of papers, people, journals, etc. In order to create 
such a ranking of scientific authors, it is necessary to 
describe each author's full publication and citation 
record to a single scalar measure, TM. It is obvious 
that any choice of Ai that is independent of the ci- 
tation record (e.g., the number of papers published 
per year) is likely to be a poor measure of research 
quality. However, it is less clear what constitutes a 
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"good" measure of an author's full citation record. 
We have previously discussed this question in some 
detail [1, 2], but in the light of the present report, the 
subject appears to merit further discussion. Below, 
we elaborate on the definition of the term "good" in 
the context of ranking scientific authors by describ- 
ing how to assign objective (i.e., purely statistical) 
uncertainties to any given choice of M. 

2. IMPROBABLE AUTHORS 

It is possible to divide the question of what con- 
stitutes a "good" scalar measure of author perfor- 
mance into two components. One aspect is wholly 
subjective and not amenable to quantitative investi- 
gation. We illustrate this with an example. Consider 
two authors, A and B, who have written 10 papers 
each. Author A has written 10 papers with 100 cita- 
tions each and author B has written one paper with 
1000 citations and 9 papers with citations. First, 
we consider an argument for concluding that author 
A is the "better" of the two. 

In spite of varying citation habits in different fields 
of science, the distribution of citations within each 
field is a highly skewed power-law type distribution 
(e.g. see [3, 4]). Because of the power-law structure 
of citation distributions, the citation record of au- 
thor A is more improbable than that of author B. It 
is illuminating to quantify the difference between the 
two authors using a real dataset. Here, we use data 
from the SPIRES database for high energy physics 
(see [3] for details regarding this dataset). The ci- 
tation summary option on the SPIRES website re- 
turns the number of papers for a given author with 
citations in each of six intervals. These intervals and 
the probabilities that papers will fall in these bins 
are given in Table 1. The probability, P{{ni}), that 
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Table 1 

The search option citation summary at the SPIRES website 

returns the number of papers for a given author with 
citations in the intervals shown. The probabilities of getting 
citations in these are intervals are listed in the third column 



Paper category 


Citations 


Probability P{i) 


Unknown papers 







0.267 


Less known papers 


1- 


-9 


0.444 


Known papers 


10- 


-49 


0.224 


Well-known papers 


50- 


-99 


0.0380 


Famous papers 


100- 


-499 


0.0250 


Renowned papers 


500+ 


0.00184 



an author's actual citation record of N papers was 
obtained from a random draw on the citation dis- 
tribution P(i) is readily calculated by multiplying 
the probabilities of drawing the author's number of 
publications in the different categories, rii, and cor- 
recting for the number of permutations. 

. n^. 

If a total of papers is drawn at random on 
the citation distribution, the most probable result, 
-P({^j}max)i corresponds to = NP{i) papers in 
each bin. The quantity 

r--log [ 

. ^ J max 

is a useful measure of this probability which is rel- 
atively independent of the number of bins chosen. 
In the case of author A we find the value = 14.4, 
and for author B we find = 5.33. In spite of the 
fact that A and B have the same average number 
of citations, the record of author B is roughly 10^ 
times more probable that that of author A\ We do 
not claim that the "unlikelihood" measure r cap- 
tures the richness of the full data set nor that it cap- 
tures the complexity of individual citation records,^ 
but we do claim that the extreme improbability of 
author A might convince some to choose her over 
author B. 

On the other hand, if one believes that the most 
highly cited papers have a special importance — that 
they contain scientific results that are particularly 
significant or noteworthy — one might reasonably pre- 
fer B over A. A famous proponent of this view is 
the father of bibliometrics, E. Garfield, who dubbed 



such papers "citation classics" [5]. No amount of 
quantitative research will convince a supporter of 
citation classics that the improbability of a citation 
record is a better measure of the scientific signifi- 
cance of an author or vice versa; this judgment is 
strictly subjective. 

3. DISCRIMINATORY ABILITY 

As mentioned above, scalar measures of author 
quality also contain an element that can be assessed 
objectively. Whatever the intrinsic and value-based 
merits of the measure, Ai, assigned to every au- 
thor, it will be of no practical value unless the corre- 
sponding uncertainty, 5M is sufficiently small. From 
this point of view, the "best" choice of measure will 
be that which provides maximal discrimination be- 
tween scientists and hence the smallest value of 6A4 . 
If a measure cannot be assigned to a given author 
with suitable precision, the subjective issue of its re- 
lation to author quality is rendered moot. Below we 
outline how the question of deciding which of sev- 
eral proposed measures is most discriminating, and 
therefore "best," can be addressed quantitatively us- 
ing standard statistical methods. 

The model that authors A and B draw their cita- 
tion records on the total citation distribution P{i) 
is quite primitive. This is indicated by the fact that 
the numerical values of r for both A and B are un- 
comfortably large. It is more reasonable to assume 
that each author's record was drawn on some sub- 
distribution of citations. By using various measures 
of author quality to construct such sub-distributions, 
we can gauge their discriminatory abilities. We for- 
malize this idea below. 

We start by binning all authors according to some 
tentative indicator, A4, obtained from their full cita- 
tion record. The probability that an author will lie in 
bin a is denoted p{a). Similarly, we bin each paper 
according to the total number of its citations.^ The 
full citation record for an author is simply the set 
{rii}. For each author bin, a, we then empirically 
construct the conditional probability distribution, 
P{i\a), that a single paper by an author in bin a 
will lie in citation bin i. These conditional probabil- 
ities are the central ingredient in our analysis. They 
can be used to calculate the probability, P{{ni}\a), 
that any full citation record was actually drawn at 



'^For example, it is possible to be an improbably bad author. 



^We use Greek letters when binning with respect to M and 
Roman for binning citations. 



DISCUSSION 



3 



random on the conditional distribution, P(i\a) ap- 
propriate for a fixed author bin, a. Bayes' theorem 
allows us to invert this probability to yield 

(1) P{a\{ni})^P{{ni}\a)p{a), 

where P{a\{ni}) is the probability that the cita- 
tion record {rii} was drawn at random from author 
bin Q. By considering the actual citation histories of 
authors in bin /3, we can thus construct the proba- 
bility P{a\(3), that the citation record of an author 
initially assigned to bin /3 was drawn on the distri- 
bution appropriate for bin a. In other words, we can 
determine the probability that an author assigned to 
bin (3 on the basis of the tentative indicator should 
actually be placed in bin a. This allows us to de- 
termine both the accuracy of the initial author as- 
signment and its uncertainty in a purely statistical 
fashion. 

While a good choice of indicator will assign each 
author to the correct bin with high probability, this 
will not be the case for a poor measure. Consider 
extreme cases in which we elect to bin authors on 
the basis of indicators unrelated to scientific quality, 
e.g., by hair /eye color or alphabetically. For such 
indicators, P{i\a) and P{{ni}\a) will be indepen- 
dent of a, and P(a\{ni}) will be proportional to the 
prior distribution p{a). As a consequence, the pro- 
posed indicator will have no predictive power what- 
soever. The utility of a given indicator (as indicated 
by the statistical accuracy with which a value can 
be assigned to any given author) will obviously be 
enhanced when the basic distributions P{i\a) de- 
pend strongly on a. These differences can be for- 
malized using the standard Kullback-Leibler diver- 
gence. The method outline above was applied to sev- 
eral measures of author performance in [1, 2]. Some 
familiar measures, including papers per year and the 
Hirsch index [6], do not reflect an author's full cita- 
tion record and are little better than a random rank- 
ing of authors. The most accurate measures (e.g., 
mean or median citations per paper) are able to 
assign authors to the correct decile bin with 90% 
confidence on the basis of approximately 50 papers. 
Since the accuracy of assignment grows exponen- 
tially with the number of papers, the evaluation of 
authors with significantly fewer papers is not likely 
to be useful. 

4. DATA HOMOGENEITY 

The average number of citations for a scientific 
paper varies significantly from field to field. A study 



of the impact factors on Web of Science [7] show 
that an average paper in molecular biology and bio- 
chemistry receives approximately 6 times more ci- 
tations than a paper in mathematics. Such distinc- 
tion, which are unrelated to field size or publica- 
tion frequency, are entirely due to differences in the 
accepted referencing practice which have emerged 
in separate scientific fields. It is obvious that a fair 
comparison of authors in different fields must rec- 
ognize and correct for such cultural inhomogeneities 
in the data. This task is more difficult than might 
be expected since significant differences in referenc- 
ing/citation practice can be found at a surprisingly 
microscopic level. Consider the following subfield hi- 
erarchy: 

physics high energy physics 
—5- high energy theory 
superstring theory. 

Study of the SPIRES database reveals that the nat- 
ural assumption of identical referencing/citation pat- 
terns for string and non-string theory papers is grossly 
incorrect. Since its emergence in the 1980s, string 
theory has evolved into a distinct and largely self- 
contained subfield with its own characteristic ref- 
erencing practices. Specifically, our studies indicate 
that the average number of citations/references for 
string theory papers is now roughly twice that of 
non-string theory papers in theoretical high energy 
physics. Any attempt to compare string theorists 
with non-string theorists will be meaningless unless 
these non-homogeneities are recognized and taken 
into consideration. Unfortunately, such information 
is not usually supplied by or readily obtainable from 
commercial databases. 

5. IN SUMMARY 

The Committee's report provides a much needed 
criticism of common misuses of citation data. By 
attempting to separate issues that are amenable to 
statistical analysis from purely subjective issues, we 
hope to have shown that serious statistical analysis 
does have a place in a field that is currently domi- 
nated by ad hoc measures, rationalized by anecdo- 
tal examples and by comparisons with other ad hoc 
measures. The probabilistic methods outlined above 
permit meaningful comparison of scientists work- 
ing distinct areas with minimal value judgments. It 
seems fair, for example, to declare equality between 
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scientists in the same percentile of their peer groups. 
It is similarly possible to combine probabilities in or- 
der to assign a meaningful ranking to authors with 
publications in several disjoint areas. All that is re- 
quired is knowledge of the conditional probabilities 
appropriate for each homogeneous subgroup. 

We emphasize that meaningful statistical analysis 
requires the availability of data sets of demonstrated 
homogeneity. The common tacit assumption of ho- 
mogeneity in the absence of evidence to the contrary 
is not tenable. Finally, we note that statistical anal- 
yses along the lines indicated here are capable of 
identifying groups of scientists with similar citation 
records in a manner which is both objective and of 
quantifiable accuracy. The interpretation of these ci- 
tation records and their relationship to intrinsic sci- 
entific quality remains a subjective and value-based 
issue. 
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